Red Wine Quality Exploration by Leo Silva

Univariate Plots Section

## [1] 1599
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality.factor
##  Min.   : 8.40   Min.   :3.000   3: 10         
##  1st Qu.: 9.50   1st Qu.:5.000   4: 53         
##  Median :10.20   Median :6.000   5:681         
##  Mean   :10.42   Mean   :5.636   6:638         
##  3rd Qu.:11.10   3rd Qu.:6.000   7:199         
##  Max.   :14.90   Max.   :8.000   8: 18

I’ll start by ploting a histogram of the quality variable to check how it’s distributed.

Now that I have the above histogram I’ll plot the histogram of another variables present in the dataset to check which ones have a distibuiton that looks like the plot above.

From all the histograms above I’d say that the variables that have more chance of having some effect or correlation with the quality of the wine are fixed.acidity, volatile.acidity, pH and density. I’ll dig a bit more into this in the Bivariate Plots Section.

Univariate Analysis

Introduction

What is the structure of your dataset?

This dataset has 1599 entries of the red Portuguese “Vinho Verde” wine containing 12 variables as below:

1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

The quality variable is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). So quality is a qualitative (categorical) variable. The other variables are the results objective tests (e.g. PH values).

What is the main feature of interest in your dataset?

The main feature in this dataset is the quality of the wine and I’m particularly interest in finding which variable(s) had influenced the most in the quality of those wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the graphs I just ploted above I think that fixed.acidity, volatile.acidity, ph and density are good candidates for supporting this investigation.

Did you create any new variables from existing variables in the dataset?

Yes, I created the variable quality.factor which is the quality variable casted into the factor format. That may help if I want to plto some boxplots using the quality variable in the x axis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I didn’t see any unusual distribuition but I did see some variables with very skewed histograms such as residual.sugar and chlorides.

Bivariate Plots Section

Below I’m going to plot scatter plots with regression lines using linear model and to complement them I’ll plot bloxplots using quality.factor.

THe objective here is to see how each of the variables in the dataset relates with quality.

Bivariate Analysis

Analizing the plots above I noticed that quality does not correlate with most of the other variables. fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and pH.

The plots of volatile.acidity, citric.acid, density, sulphates and alcohol show that they do correlate with quality with alcohol being the variable that correlates with quality the most.

Multivariate Plots Section

Multivariate Analysis

I decided to scatter plot the relationship between quality, alcohol and the other variables with strong relationships with quality: volatile.acidity, citric.acid, density, sulphates.

And isn’t it beautiful how those plots show exactly how they relate with each other? :)

They confirm the correlation numbers, some with a more dense plot, others are more spread (e.g. citric.acid). Some have more outliers then others, of course, but they confirm the correlations listed in the Bivariate Analysis Section and graphically show their relationships with each other and with quality.

Final Plots and Summary

Plot One

Description One

Even though the quality variable is defined as numerical in the dataset, by definition it is a categorical variable and thus this distribuition can not be called normal nor we can calculate the correlation between quality and other variables. But we can say that most of the wines are of quality 5 and 6 while 3 and 8 (worst and best wines respectively) have the least counts.

Plot Two

Description Two

Those plots show that there’s a strong relationship between the alcohol variable with quality. All 3 graphs, the scatterplot, the linear regression line and the boxplots tell the same story that the higher the % of alcohol in the wine the better the quality perception.

Plot Three

Description Three

Those 4 scatter plots show the relationship of alcohol, quality and other 4 variables and how they relate with each other. Example: a red wine with low % of alcohol and low volatile.acidity has way more chance of being evaluated as low quality of a wine that has high % of alcohol and high volatile.acidity.


Reflection

I’ve found that those 5 variables (alcohol, volatile.acidity, citric.acid, density and sulphates) have a strong correlation with the wines quality score. To take this analysis a step further I would try to create a regression model using those 5 variables to calculate the wine’s quality based on the values of those variables.